Sparse Matrix Multiplication with Bandwidth Restricted All-to-All Communication

نویسندگان

  • Keren Censor-Hillel
  • Dean Leitersdorf
  • Elia Turner
چکیده

We show how to multiply two n × n matrices over semirings in the Congested Clique model, where n nodes synchronously communicate in an all-to-all manner using O(log n)-bit messages, within a round complexity that depends on the number of non-zero elements in the input matrices. By leveraging the sparsity of the input matrices, our algorithm reduces communication costs and thus improves upon the state-of-the-art for matrices with o(n) nonzero elements. Moreover, our algorithm exhibits the additional strength of surpassing previous solutions also in the case where only one of the two matrices is such. Particularly, this allows to efficiently raise a sparse matrix to a power greater than 2. As applications, we show how to speed up the computation on non-dense graphs of 3and 4-cycle counting, as well as of all-pairs-shortest-paths. Our algorithmic contribution is a new deterministic method of restructuring the input matrices in a sparsity-aware manner, which assigns each node with element-wise multiplication tasks that are not necessarily consecutive but guarantee a balanced element distribution, providing for communication-efficient multiplication. As such, our technique may be useful in additional computational models. ∗Department of Computer Science, Technion. Email: [email protected]. Supported in part by ISF grant 1696/14. †Department of Computer Science, Technion. Email: [email protected]. ‡Department of Computer Science, Technion. Email: [email protected]. ar X iv :1 80 2. 04 78 9v 2 [ cs .D S] 1 4 Fe b 20 18

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Breaking the performance bottleneck of sparse matrix-vector multiplication on SIMD processors

The low utilization of SIMD units and memory bandwidth is the main performance bottleneck on SIMD processors for sparse matrix-vector multiplication (SpMV), which is one of the most important kernels in many scientific and engineering applications. This paper proposes a hybrid optimization method to break the performance bottleneck of SpMV on SIMD processors. The method includes a new sparse ma...

متن کامل

Reducing latency cost in 2D sparse matrix partitioning models

Sparse matrix partitioning is a common technique used for improving performance of parallel linear iterative solvers. Compared to solvers used for symmetric linear systems, solvers for nonsymmetric systems offer more potential for addressing different multiple communication metrics due to the flexibility of adopting different partitions on the input and output vectors of sparse matrix-vector mu...

متن کامل

Scalable Blas 2 and 3 Matrix Multiplication for Sparse Banded Matrices on Distributed Memory Mimd Machines

In this paper, we present two algorithms for sparse banded matrix-vector and sparse banded matrix-matrix product operations on distributed memory multiprocessor systems that support a mesh and ring interconnection topology. We aslo study the scalability of these two algorithms. We employ systolic type techniques to eliminate synchronization delay and minimize the communication overhead among pr...

متن کامل

Reconfigurable Sparse Matrix-Vector Multiplication on FPGAs

executing memory-intensive simulations, such as those required for sparse matrix-vector multiplication. This effect is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can be distributed, along with multiple memory interfaces, into...

متن کامل

Communication Avoiding (CA) and Other Innovative Algorithms

In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1802.04789  شماره 

صفحات  -

تاریخ انتشار 2018